Mastering the Art of Data Preprocessing.docx

Normalize? Encode? Impute? Let’s Make It Simple

“We don’t have better algorithms. We just have more data — and we clean it better.”
Peter Norvig, Director of Research at Google, said this, and back then I didn’t really get it. I thought cleaning data just meant removing a few empty rows or fixing a typo here and there. It didn’t sound that important.

The first time I trained a machine learning model, I used the Titanic dataset. It performed really well, and I felt proud. But I didn’t realize that most of the hard work had already been done for me. The data was already clean and well-prepared. I didn’t know how much that mattered.

Then I tried building another model, this time using real-world user click data. And it was a disaster. The model failed badly, no matter how much I tuned or changed things. I kept wondering what I was doing wrong.

After spending a lot of time and effort, I finally understood the problem. The dataset was messy. It had missing values, strange categories, and numbers that were way off balance. I had skipped over all of that, thinking I could just jump straight into building the model. But it doesn’t work like that.

That’s when I realized how important data preprocessing really is. It’s not just a step to get over with. It’s the part that decides how well everything else will go.

Lets discover a clear and organized way to do preprocessing, and how to choose the right techniques depending on your data. Let answer the questions I wish I had asked earlier, like:

What’s the difference between normalization and standardization?
How should I handle missing values?
When should I use one-hot encoding or label encoding?

If you're working with data or planning to start, this will help you avoid the mistakes I made in the beginning.

Mastering the Art of Data Preprocessing: Choosing the Right Techniques for the Right Data

Peter Norvig, Director of Research at Google, once said: "We don’t have better algorithms. We just have more data — and we clean it better."

When I first came across this quote, I didn’t fully understand its weight. I thought data cleaning was a basic chore. Just remove some empty rows, fix a few typos, and you’re good to go, right? Not quite.

My journey into machine learning began with the famous Titanic dataset. I trained my first model on it, and it performed surprisingly well. I felt confident, even excited. What I didn’t realize at the time was that most of the data preprocessing had already been done. The dataset was clean, balanced, and ready for modeling.

Encouraged by my initial success, I decided to tackle a more complex dataset: user clickstream data from a real-world source. This time, things didn’t go well. My model failed to perform, no matter how many parameters I tuned or how much time I spent optimizing the algorithm.I was frustrated, confused, and stuck.

The Turning Point

After weeks of trial and error, I took a step back and re-examined the data. That’s when I discovered the real issue. The dataset was filled with missing values, inconsistent categories, and wildly skewed distributions. And I had done almost nothing to address those problems.

That experience was my wake-up call. It taught me that data preprocessing isn’t a task to rush through. It’s a crucial foundation for every machine learning model you build. A well-preprocessed dataset can make even a simple model shine. A poorly prepared dataset can make even the most powerful algorithms fail.

So in this blog, lets share a clear, structured way to approach data preprocessing. Not just what to do, but how to think about it. And more importantly, how to decide which techniques to use depending on your data.

Step-by-Step Guide to Data Preprocessing

Step-by-Step Guide to Data Preprocessing

Start by knowing your data well. Before jumping into any tools or models, you need to sit with the data and understand what you are dealing with. Imagine you're handed a notebook filled with numbers, words, and blanks. What should be your first move? Let’s go step by step.

Understand What You’re Working With
When you first get a dataset, the best thing you can do is take a look around. Ask yourself, what kind of data is this? You might find numbers, categories, or even full sentences. Some data might be missing altogether.

A good way to begin is to check how many columns and rows there are. Then look at what each column represents. Is it age, city name, or a product description? Each type of data will need a different kind of care.

A simple example: Suppose you have a file with house listings. You’ll likely find prices, locations, number of bedrooms, and maybe a short description. Price and bedrooms are numbers. Location is a category. Description is text. These differences matter because the way you treat numbers is different from how you treat text or categories.

Handling Missing Values
Sometimes you’ll open a dataset and find blank spaces. Maybe a few people didn’t fill in their age. Or a city name is missing. You need to decide what to do.There are different ways to handle missing data:

If a large part of a column is missing, it's often better to drop that column entirely. Imagine you have 1000 rows and 900 of them are missing values in the "monthly income" column. That data won’t help much and could mislead the model.

If only a few values are missing, you can fill them in. This is called imputation. The method you choose depends on the kind of data you have and whether it contains outliers.

If the values are numerical and there are no big outliers, you can use the mean. For example, if you’re missing two values for a student's marks out of 100, and the rest of the marks are around 60 to 70, using the average score makes sense.

But if there are outliers, like one student scoring 5 and another scoring 100 while the rest score around 70, the mean will be pulled up or down. In that case, use the median instead, which is the middle value. The median is not affected by extremely high or low values.

For categorical data, you can use the most frequent value. For example, if most people in your dataset are from Karachi, and some city names are missing, you can fill them with Karachi.

You can also add a new category like "Unknown" for missing values if the absence might be meaningful. For example, if the profession is missing, maybe the person didn’t want to share it, which itself tells something.

Important note “If you impute too many missing values with the same method (like mean or mode), your dataset can lose its natural variation — making it misleading or biased.”For example, if you replace every missing income value with the mean income, you’ll make your data look more consistent than it really is.

Understanding the Types of Data
Data comes in different forms. Some are continuous numbers, like height or weight. Some are categories, like city names or shirt sizes. Some categories are ordered, like "small," "medium," and "large." Others are unordered, like "red," "blue," and "green."

Models can't work directly with text, so you need to convert it into numbers. This is called encoding.If your categories don’t have any natural order, like types of fruits (apple, banana, mango), you can use one-hot encoding. This creates a new column for each fruit and puts a one in the column that matches and zeros in the rest.

If your categories are ordered, like experience levels (beginner, intermediate, expert), you can use label encoding, where you assign 0, 1, and 2. This keeps the order and helps models understand the progression.

Rescale the Numbers
Sometimes different columns have values on very different scales. For example, income might be in thousands, while age is just a two-digit number. If you use these directly, the model might think income is more important simply because the numbers are bigger.

To fix this, you can rescale your numbers. Normalization rescales values between 0 and 1. This is useful when your data doesn't follow a normal distribution. Standardization adjusts values so they have a mean of 0 and a standard deviation of 1. This is good if your data is roughly bell-shaped or normally distributed.

Example: Suppose you have height in centimeters and weight in kilograms. If you're using a model like k-nearest neighbors, which depends on distance, you should normalize or standardize. Otherwise, the model will focus more on the larger numbers, like height in centimeters.

Look Out for Outliers
Outliers are values that are far away from the others. For example, if most house prices are around 2 million, but one is listed as 200 million, that’s an outlier.Outliers can come from errors, like a typo, or real situations, like a luxury house. Either way, they can affect your model badly, especially if you're calculating the mean or using linear models.

You can detect outliers by plotting your data or using simple rules. For example, if a value is far beyond the typical range (like more than 1.5 times the interquartile range), it may be an outlier.Once detected, you have options:

You can remove the outlier if it's clearly an error. You can cap it, which means replacing it with a maximum limit. Or you can log-transform the data to reduce the impact of large values.

Example: Suppose you have the ages of users, and one user is listed as 150 years old. That’s probably a mistake. You might remove that row or cap it at a maximum reasonable age like 100.

Create New Useful Data
You don’t always have to work with what you're given. Sometimes you can create new features that help your model.

For example, if you have a column with date of birth, you can calculate age. If you have total bill and number of items, you can create a column for average price per item.This is called feature engineering and can be a powerful way to give your model more useful information.

Example: In an online shopping dataset, instead of using only the "number of visits" and "number of purchases," you can create a new column called "conversion rate" by dividing purchases by visits.

Split the Data
Before training your model, split your data. This helps you understand how your model performs on new, unseen data.Usually, you divide it into training data and testing data. You can also create a third set, called the validation set, to fine-tune your model.

Example: Suppose you have 1000 rows. You can use 700 for training, 150 for validation, and 150 for testing. This way, you can build, improve, and then fairly evaluate your model.

Final Words
Data preprocessing is not about deleting or fixing random things. It’s about understanding your data, cleaning it gently, and preparing it carefully. A good model needs good data. If you don’t handle your data properly, even the best model won’t perform well.

Take time to explore, ask questions, and think. Learn how data types behave, how to handle missing or unusual values, and how to create new signals from old ones. Even if you are just starting, learning to respect your data is the first big step in building great projects.

Mastering the Art of Data Preprocessing.docx

Pages

Services

Get In Touch